NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Rich and the Simple: On the Implicit Bias of Adam and SGD

Vasudeva, Bhavya; Lee, Jung W; Sharan, Vatsal; Soltanolkotabi, Mahdi (December 2025, Advances in Neural Information Processing Systems)

Free, publicly-accessible full text available December 1, 2026
EMERGENCE AND EVOLUTION OF INTERPRETABLE CONCEPTS IN DIFFUSION MODELS

Tinaz, Berk; Fabian, Zalan; Soltanolkotabi, Mahdi (November 2025, Proceedings of Neural Information Processing (Neurips 2025))

Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from noise through a process called reverse diffusion. Understanding the dynamics of the reverse diffusion process is crucial in steering the generation and achieving high sample quality. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic Interpretability (MI) techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the operating principles of models through granular analysis of their internal representations. These MI techniques have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we show that the discovered concepts have a causal effect on the model output and can be leveraged to steer the generative process. We design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages of diffusion image composition is finalized, however stylistic interventions are effective, and (3) in the final stages of diffusion only minor textural details are subject to change.
more » « less
Free, publicly-accessible full text available November 1, 2026
Test-Time Training Provably Improves Transformers as In-context Learners

Gozeten, Halil Alperen; Ildiz, Emrullah; Zhang, Xuechen; Soltanolkotabi, Mahdi; Mondelli, Marco; Oymak, Samet (July 2025, Proceedings of International conference on Machine Larning (ICML 2025))

est-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
more » « less
Free, publicly-accessible full text available July 31, 2026
Test-Time Training Provably Improves Transformers as In-context Learners

Gozeten, Halil A; Ildiz, Emrullah; Zhang, Xuechen; Soltanolkotabi, Mahdi; Mondelli, Marco; Oymak, Samet (July 2025, Proceedings of the 42nd International Conference on Machine Learning)

Free, publicly-accessible full text available July 15, 2026
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models

Fabian, Zalan; Tinaz, Berk; Soltanolkotabi, Mahdi (July 2024, Proceedings of International Conference on Machine Learning Research (ICML))

Inverse problems arise in a multitude of applications, where the goal is to recover a clean signal from noisy and possibly (non)linear observations. The difficulty of a reconstruction problem depends on multiple factors, such as the ground truth signal structure, the severity of the degradation and the complex interactions between the above. This results in natural sample-by-sample variation in the difficulty of a reconstruction problem. Our key observation is that most existing inverse problem solvers lack the ability to adapt their compute power to the difficulty of the reconstruction task, resulting in subpar performance and wasteful resource allocation. We propose a novel method, severity encoding, to estimate the degradation severity of corrupted signals in the latent space of an autoencoder. We show that the estimated severity has strong correlation with the true corruption level and can provide useful hints on the difficulty of reconstruction problems on a sample-by-sample basis. Furthermore, we propose a reconstruction method based on latent diffusion models that leverages the predicted degradation severities to fine-tune the reverse diffusion sampling trajectory and thus achieve sample-adaptive inference times. Our framework, Flash-Diffusion, acts as a wrapper that can be combined with any latent diffusion-based baseline solver, imbuing it with sample-adaptivity and acceleration. We perform experiments on both linear and nonlinear inverse problems and demonstrate that our technique greatly improves the performance of the baseline solver and achieves up to 10× acceleration in mean sampling speed.
more » « less
Full Text Available
DIRACDIFFUSION: DENOISING AND INCREMENTAL RECONSTRUCTION WITH ASSURED DATA-CONSISTENCY

Fabian, Zalan; Tinaz, Berk; Soltanolkotabi, Mahdi (July 2024, Proceedings of International Conference on Machine Learning Research (ICML))

Diffusion models have established new state of the art in a multitude of computer vision tasks, in- cluding image restoration. Diffusion-based inverse problem solvers generate reconstructions of ex- ceptional visual quality from heavily corrupted measurements. However, in what is widely known as the perception-distortion trade-off, the price of perceptually appealing reconstructions is often paid in declined distortion metrics, such as PSNR. Distortion metrics measure faithfulness to the observation, a crucial requirement in inverse problems. In this work, we propose a novel framework for inverse problem solving, namely we assume that the observation comes from a stochastic degra- dation process that gradually degrades and noises the original clean image. We learn to reverse the degradation process in order to recover the clean image. Our technique maintains consistency with the original measurement throughout the reverse process, and allows for great flexibility in trading off perceptual quality for improved distortion metrics and sampling speedup via early-stopping. We demonstrate the efficiency of our method on different high-resolution datasets and inverse problems, achieving great improvements over other state-of-the-art diffusion-based methods with respect to both perceptual and distortion metrics. Source code and pre-trained models will be released soon.
more » « less
Full Text Available
Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks

Collins, Liam; Hassani, Hamed; Soltanolkotabi, Mahdi; Mokhtari, Aryan; Shakkottai, Sanjay (July 2024, Proceedings of International Conference on Machine Learning Research (ICML))

An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an r-dimensional subspace within the d k r-dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of d. In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all r ground-truth features.
more » « less
Full Text Available
Learning from many trajectories

Tu, Stephen; Frostig, Roy; Soltanolkotabi, Mahdi (April 2024, Journal on Machine Learning Research)

We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given n-dimensional examples produced by m trajectories, each of length T, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely m<=n) to many (namely m>=n). Specifically, we establish that the worst-case error rate of this problem is n/(mT) whenever m>=n. Meanwhile, when m<=n, we establish a (sharp) lower bound of n^2/(m^2T) on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
more » « less
Full Text Available
Learning Provably Robust Estimators for Inverse Problems via Jittering

Krainovic, Anselm; Soltanolkotabi, Mahdi; Heckel, Reinhard (December 2023, Proceedings of Neural Information Processing Systems)

Deep neural networks provide excellent performance for inverse problems such as denoising. However, neural networks can be sensitive to adversarial or worst-case perturbations. This raises the question of whether such networks can be trained efficiently to be worst-case robust. In this paper, we investigate whether jittering, a simple regularization technique that adds isotropic Gaussian noise during training, is effective for learning worst-case robust estimators for inverse problems. While well studied for prediction in classification tasks, the effectiveness of jittering for inverse problems has not been systematically investigated. In this paper, we present a novel analytical characterization of the optimal -worst-case robust estimator for linear denoising and show that jittering yields optimal robust denoisers. Furthermore, we examine jittering empirically via training deep neural networks (U-nets) for natural image denoising, deconvolution, and accelerated magnetic resonance imaging (MRI). The results show that jittering significantly enhances the worst-case robustness, but can be suboptimal for inverse problems beyond denoising. Moreover, our results imply that training on real data which often contains slight noise is somewhat robustness enhancing.
more » « less
Full Text Available
Don't Memorize; Mimic The Past: Federated Class Incremental Learning Without Episodic Memory

Babakniya, Sara; Fabian, Zalan; He, Chaoyang; Soltanolkotabi, Mahdi; Avestimehr, Salman (December 2023, Proceedings of Neural Information Processing Systems)

Deep learning models are prone to forgetting information learned in the past when trained on new data. This problem becomes even more pronounced in the context of federated learning (FL), where data is decentralized and subject to independent changes for each user. Continual Learning (CL) studies this so-called \textit{catastrophic forgetting} phenomenon primarily in centralized settings, where the learner has direct access to the complete training dataset. However, applying CL techniques to FL is not straightforward due to privacy concerns and resource limitations. This paper presents a framework for federated class incremental learning that utilizes a generative model to synthesize samples from past distributions instead of storing part of past data. Then, clients can leverage the generative model to mitigate catastrophic forgetting locally. The generative model is trained on the server using data-free methods at the end of each task without requesting data from clients. Therefore, it reduces the risk of data leakage as opposed to training it on the client's private data. We demonstrate significant improvements for the CIFAR-100 dataset compared to existing baselines.
more » « less
Full Text Available

« Prev Next »

Search for: All records